STATS 32 Session 10: A Crash Course in Statistics and Modeling

Kenneth Tay

Oct 24, 2019

Announcements

Project due on 2 Nov (Sat) 23:59:59

Remaining office hours:

10am-12pm, Sequoia Hall Rm 105

Recap of session 9

Agenda for today

A very high level picture: for technical details, take STATS 60/STATS 101

Recall: Lists

cars <- list(make = "Honda", 
             models = c("Fit", "CR-V", "Odyssey"), 
             available = c(TRUE, TRUE, TRUE))

Extracting parts of a list

Use [[ or $ notation to refer to a specific key-value pair

cars$make         # no quotation marks
## [1] "Honda"
cars[["models"]]  # remember quotation marks!
## [1] "Fit"     "CR-V"    "Odyssey"

Recall: Data frames are lists!

Today’s dataset: Top 100 songs on Spotify

(Source: Spotify)

Tempo by mode: Is there a difference?

Hard to tell from the histograms:

Look at mean tempo for each mode

Is this difference significant? What do we mean by significance anyway?

Structure of a hypothesis test

  1. Start with a null hypothesis: An assumption on how the data is generated

Structure of a hypothesis test

  1. Start with a null hypothesis: An assumption on how the data is generated
  2. Based on this assumption, how likely were we to collect data as extreme as what we have?
    • p-value: probability of collecting data as extreme as ours (if null hypothesis is true)

Structure of a hypothesis test

  1. Start with a null hypothesis: An assumption on how the data is generated
  2. Based on this assumption, how likely were we to collect data as extreme as what we have?
    • p-value: probability of collecting data as extreme as ours (if null hypothesis is true)
  3. Is the p-value considered low or not?
    • Threshold should depend on the context
    • Typical thresholds, 0.1, 0.05, 0.01

Structure of a hypothesis test

  1. Start with a null hypothesis: An assumption on how the data is generated
  2. Based on this assumption, how likely were we to collect data as extreme as what we have?
    • p-value: probability of collecting data as extreme as ours (if null hypothesis is true)
  3. Is the p-value considered low or not?
    • Threshold should depend on the context
    • Typical thresholds, 0.1, 0.05, 0.01
  4. If p-value is below threshold, 2 possible conclusions:
    • A rare event just happened, or
    • Our assumption in Step 1 was false

Hypothesis test: coin flipping example

I flip a coin 20 times and it came out heads 16 times. Is my coin biased?

Hypothesis test: coin flipping example

I flip a coin 20 times and it came out heads 16 times. Is my coin biased?

  1. Start with a null hypothesis: Probability of heads \(p = 0.5\)

Hypothesis test: coin flipping example

I flip a coin 20 times and it came out heads 16 times. Is my coin biased?

  1. Start with a null hypothesis: Probability of heads \(p = 0.5\)

  2. Based on this assumption, how likely were we to collect data as extreme as what we have?
    • “As extreme”: 16 or more heads, or 4 or less heads
    • Probability of collecting data as extreme as ours: 0.0118

Hypothesis test: coin flipping example

I flip a coin 20 times and it came out heads 16 times. Is my coin biased?

  1. Start with a null hypothesis: Probability of heads \(p = 0.5\)

  2. Based on this assumption, how likely were we to collect data as extreme as what we have?
    • “As extreme”: 16 or more heads, or 4 or less heads
    • Probability of collecting data as extreme as ours: 0.0118
  3. Is the p-value considered low or not?

Hypothesis test: coin flipping example

I flip a coin 20 times and it came out heads 16 times. Is my coin biased?

  1. Start with a null hypothesis: Probability of heads \(p = 0.5\)

  2. Based on this assumption, how likely were we to collect data as extreme as what we have?
    • “As extreme”: 16 or more heads, or 4 or less heads
    • Probability of collecting data as extreme as ours: 0.0118
  3. Is the p-value considered low or not?

  4. If p-value is below threshold, 2 possible conclusions:
    • A rare event just happened, or
    • Our assumption in Step 1 was false

Tempo by mode: Is there a difference?

Two options:

Tempo by mode: Is there a difference?

Two options:

What is a model?

Two steps to modeling

Step 1: Identify a family of models which express a generic pattern between your variables of interest.

Possible model family: Linear model, i.e. \(child = a_1 + a_2 \times parent\).

Many other possible models: linear without intercept, quadratic, exponential, …

Different models within the linear model family

Each line corresponds to a choice of \(a_1\) and \(a_2\).

Two steps to modeling

Step 2: Find the model in this family that most closely matches your data.

That is, find specific values of \(a_1\) and \(a_2\) which make the model match the data most closely.

What do we mean by “closely matching the data”?

We choose \(a_1\) and \(a_2\) such that some objective function (loss function) is minimized.

Most common objective: Minimize the sum of squares of the black lines below.

(Source: uc-r.github.io)

Linear models in R

Models with categorical variables

Consider modeling valence ~ mode.

Additive models

Formula valence ~ loudness + mode translates to

Models with interaction

Formula valence ~ loudness * mode translates to

Summary of the course

Where do we go from here?

Other Stanford courses

Thank you! :)